feat: new eval schema changes #685

radu-mocanu · 2025-10-13T14:18:35Z

Done:

added new evaluators
support new schema
wired up platform evaluators
support for custom evaluators

Tip: check calculator sample

ToDo:

rename from coded to legacy evarywhere

How to test:
example contains evaluator spec:

 {
  "version": "1.0",
  "id": "DenialCodeContains",
  "description": "Checks if the response text includes the expected denial code.",
  "evaluatorTypeId": "uipath-contains",
  "evaluatorConfig": {
    "name": "Denial Code Contains",
    "targetOutputKey": "report",
    "negated": true,
    "ignoreCase": false,
    "defaultEvaluationCriteria": {
      "searchText": "mock report"
    }
  }
}

example eval set:

{
  "version": "1.0",
  "id": "ClaimDenialReview",
  "name": "Claim Denial Review",
  "evaluatorRefs": [
    "DenialCodeContains"
  ],
  "evaluations": [
    {
      "id": "denial-default",
      "name": "Respond with default denial code",
      "inputs": {
        "human_input": "Customer asks for the denial code on claim XFC-01."
      },
      "evaluationCriterias": {
        "DenialCodeContains": null
      }
    },
    {
      "id": "denial-override",
      "name": "Respond with override denial code",
      "inputs": {
        "human_input": "Customer asks if claim XFC-02 was denied and why."
      },
      "evaluationCriterias": {
        "DenialCodeContains": {
          "searchText": "mock sreport 1234 lala"
        }
      }
    },
    {
      "id": "denial-skip",
      "name": "Skip denial code check",
      "inputs": {
        "human_input": "Customer checks status of claim XFC-03 with no denial expected."
      },
      "evaluationCriterias": {}
    }
  ]
}

Development Package

Add this package as a dependency in your pyproject.toml:

[project]
dependencies = [
  # Exact version:
  "uipath==2.1.108.dev1006852164",

  # Any version from PR
  "uipath>=2.1.108.dev1006850000,<2.1.108.dev1006860000"
]

[[tool.uv.index]]
name = "testpypi"
url = "https://test.pypi.org/simple/"
publish-url = "https://test.pypi.org/legacy/"
explicit = true

[tool.uv.sources]
uipath = { index = "testpypi" }

…issues Address Copilot comments for coded evals

feat: wiring ExactMatch evaluator to new schema

feat: wiring JsonSimilarity evaluator to new schema

- Add version property detection to distinguish coded-evals from legacy files - Update pull command to map coded-evals folder to local evals structure - Update push command to upload files with version property to coded-evals folder - Maintain backward compatibility with legacy evals folder structure - Ensure eval command works out of the box with existing structure fix: resolve eval set path for correct evaluator discovery - Update load_eval_set to return both evaluation set and resolved path - Fix evaluator discovery by using resolved path instead of original path - Ensure eval command works with files in evals/eval-sets/ and evals/evaluators/ fix: cleaning up files fix: address PR review comments 1. Move eval_set path resolution from runtime to CLI layer - Resolve path in cli_eval.py before creating runtime - Remove context update in runtime since path is already resolved - Better separation of concerns 2. Clarify directory structure comments - Make it explicit that os.path.join produces {self.directory}/evals/evaluators/ - Prevent confusion about directory paths 3. Add file deletion consistency for evaluation files - Delete remote evaluation files when deleted locally - Matches behavior of source file handling - Ensures consistency across all file types Addresses: #681 (review) Addresses: #681 (review) Addresses: #681 (comment)

feat: adding pull and push for coded-evals folder files

feat: wiring LLM judge evaluators to new schema

…_evals feat: Cherry-pick progress on parallelization of eval runs

feat: missing changes from llm eval wiring

feat: wire up trajectory eval

Add dedicated environment variable for eval endpoint routing with environment-aware localhost detection using proper URL parsing, avoiding false positives and impact to other services using UIPATH_URL. Added `UIPATH_EVAL_BACKEND_URL` for eval-specific routing: - Set to localhost URL (e.g., `http://localhost:8080`) for local development - Leave unset or set to production URL for alpha/production environments - Isolates eval endpoint routing from UIPATH_URL used by other services **New Constant:** - `ENV_EVAL_BACKEND_URL = "UIPATH_EVAL_BACKEND_URL"` in `constants.py` **Updated Helper Method with Robust URL Parsing:** - `_get_endpoint_prefix()` uses `urllib.parse.urlparse()` for accurate hostname detection - Checks parsed hostname specifically (not substring matching) - Prevents false positives like "notlocalhost.com" or "127.0.0.1.example.com" - Returns `""` (empty) only when hostname is exactly "localhost" or "127.0.0.1" - Returns `"agentsruntime_/"` for all other cases (including unset or parse failures) - Handles edge cases: case-insensitive matching, ports, protocols All 4 progress reporting endpoints now use `/coded/` path with conditional routing: | Method | Endpoint Pattern | |--------|------------------| | PUT evalRun | `{prefix}api/execution/agents/{id}/coded/evalRun` | | POST evalRun | `{prefix}api/execution/agents/{id}/coded/evalRun` | | POST evalSetRun | `{prefix}api/execution/agents/{id}/coded/evalSetRun` | | PUT evalSetRun | `{prefix}api/execution/agents/{id}/coded/evalSetRun` | Where `{prefix}` is determined by `_get_endpoint_prefix()`: - Localhost: `""` (empty - direct API access) - Alpha/Prod: `"agentsruntime_/"` (service routing) - `_update_eval_run_spec()` - Update eval run with results - `_create_eval_run_spec()` - Create new eval run - `_create_eval_set_run_spec()` - Create new eval set run - `_update_eval_set_run_spec()` - Update eval set run completion ```bash export UIPATH_EVAL_BACKEND_URL=http://localhost:8080 ``` ```bash ``` ```bash export UIPATH_EVAL_BACKEND_URL=https://alpha.uipath.com ``` ✅ **Isolated Configuration:** - Eval routing independent of UIPATH_URL - Other services using UIPATH_URL remain unaffected - Enables local eval testing without affecting other components ✅ **Robust URL Parsing:** - Uses `urllib.parse.urlparse()` for accurate hostname extraction - Prevents false positives from substring matching (e.g., "notlocalhost.com") - Handles edge cases: ports, protocols, case sensitivity - Graceful fallback on parsing errors ✅ **Simple & Explicit:** - Single environment variable controls all eval endpoint routing - Clear localhost detection (exact hostname match) - Defaults to production routing when unset (safe fallback) ✅ **Backward Compatible:** - No breaking changes to existing deployments - Defaults to `agentsruntime_/` prefix when env not set - Supports new `/coded/` evaluator API endpoints ✅ **Validation:** - Syntax validation: Passed - Logic verification: Passed (13 test scenarios including edge cases) - Files changed: 2 modified (+29, -5 lines) ✅ **Environment Detection Tests:** **Standard Cases:** - `http://localhost:8080` → ✓ Empty prefix - `http://127.0.0.1:3000` → ✓ Empty prefix - `https://localhost` → ✓ Empty prefix - `https://alpha.uipath.com` → ✓ `agentsruntime_/` prefix - `https://cloud.uipath.com` → ✓ `agentsruntime_/` prefix - Unset/empty → ✓ `agentsruntime_/` prefix (default) **Edge Cases (False Positive Prevention):** - `https://notlocalhost.com` → ✓ `agentsruntime_/` prefix (not localhost) - `https://127.0.0.1.example.com` → ✓ `agentsruntime_/` prefix (not localhost) - `https://mylocalhost.io` → ✓ `agentsruntime_/` prefix (not localhost) **Case Sensitivity:** - `http://LOCALHOST:8080` → ✓ Empty prefix (case-insensitive) - `http://LocalHost:8080` → ✓ Empty prefix (case-insensitive) **Problem:** Simple substring check `"localhost" in url` could match `"notlocalhost.com"` **Solution:** Use `urlparse()` to extract exact hostname, preventing false positives **Status:** Verified - documentation correctly references `UIPATH_EVAL_BACKEND_URL` throughout - `src/uipath/_cli/_evals/_progress_reporter.py` (+29, -5) - Added `urllib.parse.urlparse` import - Improved `_get_endpoint_prefix()` with URL parsing logic - `src/uipath/_utils/constants.py` (+1) - Added `ENV_EVAL_BACKEND_URL` constant 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

feat: wire tool evals, add mocked tool sample agent

refac: coded_evalutors -> evaluators, associated renames/changes

feat(evals): add dedicated UIPATH_EVAL_BACKEND_URL for localhost routing

feat: add support for custom evaluators

akshaylive

TBH this PR is way too big to review in one shot. If the individual PRs have already been approved, please merge it as-is, and address my comments in a subsequent PR please.

akshaylive · 2025-10-23T03:52:38Z

src/uipath/_cli/_evals/_progress_reporter.py

+            assertion_runs, evaluator_scores = self._collect_results(
+                sw_progress_item.eval_results, evaluators, spans or []  # type: ignore
+            )
+            spec = self._update_eval_run_spec(


Nit:

evaluator_runs, evaluator_scores = ..

This way, you can remove the duplicate lines for spec = ...

akshaylive · 2025-10-23T03:58:15Z

src/uipath/_cli/_evals/_models/_evaluation_set.py

+    )
+
+
+class LegacyEvaluationItem(BaseModel):


Please extend EvaluationItem. We can also get rid of AnyEvaluationItem.

akshaylive · 2025-10-23T04:00:47Z

src/uipath/_cli/_evals/_models/_evaluation_set.py

+        self.evaluations = selected_evals
+
+
+class LegacyEvaluationSet(BaseModel):


Please extend EvaluationSet.

akshaylive · 2025-10-23T04:06:44Z

src/uipath/_cli/_evals/_progress_reporter.py

+        """
+        if not evaluators:
+            return False
+        # Check the first evaluator type


Why only first?

akshaylive · 2025-10-23T04:08:51Z

src/uipath/_events/_events.py

    no_of_evals: int
-    evaluators: List[Any]
+    # skip validation to avoid abstract class instantiation
+    evaluators: SkipValidation[List[AnyEvaluator]]


EvalSetRunCreatedEvent should not be a BaseModel IMO. Please switch to @dataclass instead..

akshaylive · 2025-10-23T04:11:34Z

src/uipath/_cli/_evals/_progress_reporter.py

            )
        return assertion_runs, evaluator_scores_list

+    def _collect_coded_results(


Let's avoid duplication of code? High overlap between this and _collect_results.

akshaylive · 2025-10-23T04:14:28Z

src/uipath/_cli/_evals/_models/_evaluation_set.py

+    Discriminator(_discriminate_eval_set),
+]
+
+AnyEvaluationItem = Union[EvaluationItem, LegacyEvaluationItem]


Let's get rid of these two Any types..

Hmm, curious though what downside do you see with this ?

My main issue is that currently LegacyEvaluationItem does not extend EvaluationItem but it semantically should. Any* for these purposes is an anti-pattern IMO.

src/uipath/_cli/_evals/_runtime.py

akshaylive · 2025-10-23T04:25:53Z

src/uipath/eval/evaluators/base_evaluator.py


-        result.evaluation_time = execution_time
-        return result
+class BaseEvaluator(BaseModel, Generic[T, C, J], ABC):


Please use @dataclass from dataclasses instead of pydantic since types are not serializable.

Regardless, this base class is overly complex. Have we over-engineered this?

CC @andrei-rusu

src/uipath/agent/models/agent.py

These are required for build.

fix(TonOfFixes): lots of minor fixes

Model default is provided so updating the unit test.

ionmincu · 2025-10-24T08:56:28Z

testcases/tools-evals/src/assert.py

@@ -0,0 +1 @@
+print("to be implemented later")


dont forget about this
You can check the output file, see the logs

Printing output file... === OUTPUT FILE CONTENT === { "output": { "evaluationSetName": "Weather Tools Agent Evaluation", "evaluationSetResults": [

then assert the json is like a schema or contains some known properties like `"evaluatorName": "ToolCallArgsEvaluator" with result and a score.

andrei-rusu and others added 4 commits October 10, 2025 16:54

add initial version of revamped coded evaluators

2e24fe0

fix copilot and linting issues

0b44a7c

Merge pull request #677 from UiPath/fix/andreiru/coded_evals_copilot_…

1b016c1

…issues Address Copilot comments for coded evals

feat: new eval schema support + contain evaluator wiring

333821b

radu-mocanu force-pushed the release/revamped-evals branch from a3e9908 to 333821b Compare October 13, 2025 14:46

mjnovice and others added 9 commits October 14, 2025 16:21

feat: wiring ExactMatch evaluator to new schema

860c88f

Merge pull request #690 from UiPath/mj/wire-exact-match

fd324b0

feat: wiring ExactMatch evaluator to new schema

feat: wiring JsonSimilarity evaluator to new schema

74bc147

Merge pull request #692 from UiPath/mj/wire-json-similarity

ef92ef5

feat: wiring JsonSimilarity evaluator to new schema

feat: wiring LLM judge evaluators to new schema

e05bd98

Merge pull request #681 from UiPath/feat/updating-push-pull

2205d60

feat: adding pull and push for coded-evals folder files

feat: progress on parallelization of eval runs

7a2937d

Merge pull request #697 from UiPath/mj/wire-llm-as-a-judge

931a4c6

feat: wiring LLM judge evaluators to new schema

Chibionos changed the title ~~Release/revamped evals~~ feat: new eval schema changes Oct 16, 2025

Chibionos requested a review from akshaylive October 16, 2025 12:16

andrei-rusu and others added 13 commits October 16, 2025 18:23

Merge pull request #704 from UiPath/dev/andreiru/cherry_pick_parallel…

962f07d

…_evals feat: Cherry-pick progress on parallelization of eval runs

feat: missing changes from llm eval wiring

4964c66

Merge pull request #724 from UiPath/mj/missed-fixes

746c8a7

feat: missing changes from llm eval wiring

feat: wiring up trajectory evals

0db93c2

Merge pull request #721 from UiPath/mj/wire-trajectory

4553f34

feat: wire up trajectory eval

feat: wire tool evals, add mocked tool sample agent

9af6fa9

Merge pull request #735 from UiPath/mj/wire-tool

a533446

feat: wire tool evals, add mocked tool sample agent

refac: coded_evalutors -> evaluators, associated renames/changes

8898a71

Merge pull request #752 from UiPath/mj/refac-names

22356c5

refac: coded_evalutors -> evaluators, associated renames/changes

Merge pull request #740 from UiPath/fix/apply-new-reporting-api-change

f0883c6

feat(evals): add dedicated UIPATH_EVAL_BACKEND_URL for localhost routing

feat: add support for custom evaluators

88ca0e1

Merge pull request #703 from UiPath/feat/custom-evals

8957b97

feat: add support for custom evaluators

akshaylive reviewed Oct 23, 2025

View reviewed changes

github-actions bot added test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-llamaindex Triggers tests in the uipath-llamaindex-python repository labels Oct 23, 2025

mjnovice force-pushed the release/revamped-evals branch 4 times, most recently from db9a265 to 7efe060 Compare October 23, 2025 10:40

mjnovice mentioned this pull request Oct 23, 2025

Can we add the execution output to the EvaluationRunResult please? I'll need it for a different project. #765

Open

mjnovice force-pushed the release/revamped-evals branch from 7efe060 to 73d8d30 Compare October 23, 2025 10:46

Merge branch 'main' into release/revamped-evals

602854e

mjnovice force-pushed the release/revamped-evals branch from 73d8d30 to 602854e Compare October 23, 2025 14:44

akshaylive added the build:dev Create a dev build from the pr label Oct 23, 2025

Merge branch 'main' into release/revamped-evals

7efe612

akshaylive reviewed Oct 23, 2025

View reviewed changes

src/uipath/agent/models/agent.py Show resolved Hide resolved

akshaylive added 2 commits October 23, 2025 11:55

Merge branch 'main' into release/revamped-evals

35fc809

Merge branch 'main' into release/revamped-evals

6cb6db5

mjnovice force-pushed the release/revamped-evals branch from 35fc809 to 6cb6db5 Compare October 23, 2025 21:02

akshaylive and others added 3 commits October 23, 2025 17:01

fix(TonOfFixes): lots of minor fixes

07d1c07

These are required for build.

Merge pull request #772 from UiPath/akshaya/revamped-eval-fix

3fab172

fix(TonOfFixes): lots of minor fixes

fix(UnitTest): model default

78fd2fb

Model default is provided so updating the unit test.

ionmincu reviewed Oct 24, 2025

View reviewed changes

Chibi Vikram added 2 commits October 24, 2025 08:50

Merge branch 'main' into release/revamped-evals

4095baf

Merge branch 'main' into release/revamped-evals

dbb414c

Chibionos approved these changes Oct 24, 2025

View reviewed changes

radu-mocanu merged commit 3396560 into main Oct 24, 2025
61 of 62 checks passed

radu-mocanu deleted the release/revamped-evals branch October 24, 2025 16:14

		self.evaluations = selected_evals


		class LegacyEvaluationSet(BaseModel):

feat: new eval schema changes #685

feat: new eval schema changes #685

Uh oh!

Conversation

radu-mocanu commented Oct 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Development Package

Uh oh!

akshaylive left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ionmincu Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

radu-mocanu commented Oct 13, 2025 •

edited by github-actions bot

Loading

ionmincu Oct 24, 2025 •

edited

Loading